aggregation method
- North America > United States > Colorado (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.70)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.66)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.66)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Italy > Tuscany > Florence (0.04)
- Asia > China > Hong Kong (0.04)
- (14 more...)
When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models
Razafindralambo, Raphaël, Sun, Rémy, Precioso, Frédéric, Garreau, Damien, Mattei, Pierre-Alexandre
Diffusion models now generate high-quality, diverse samples, with an increasing focus on more powerful models. Although ensembling is a well-known way to improve supervised models, its application to unconditional score-based diffusion models remains largely unexplored. In this work we investigate whether it provides tangible benefits for generative modelling. We find that while ensembling the scores generally improves the score-matching loss and model likelihood, it fails to consistently enhance perceptual quality metrics such as FID on image datasets. We confirm this observation across a breadth of aggregation rules using Deep Ensembles, Monte Carlo Dropout, on CIF AR-10 and FFHQ. We attempt to explain this discrepancy by investigating possible explanations, such as the link between score estimation and image quality. We also look into tabular data through random forests, and find that one aggregation strategy outperforms the others. Finally, we provide theoretical insights into the summing of score models, which shed light not only on ensembling but also on several model composition techniques (e.g.
- North America > United States (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- Europe > France > Provence-Alpes-Côte d'Azur > Alpes-Maritimes > Nice (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.92)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)
Symmetric Aggregation of Conformity Scores for Efficient Uncertainty Sets
Alami, Nabil, Zakharia, Jad, Taieb, Souhaib Ben
Access to multiple predictive models trained for the same task, whether in regression or classification, is increasingly common in many applications. Aggregating their predictive uncertainties to produce reliable and efficient uncertainty quantification is therefore a critical but still underexplored challenge, especially within the framework of conformal prediction (CP). While CP methods can generate individual prediction sets from each model, combining them into a single, more informative set remains a challenging problem. To address this, we propose SACP (Symmetric Aggregated Con-formal Prediction), a novel method that aggregates nonconformity scores from multiple predictors. SACP transforms these scores into e-values and combines them using any symmetric aggregation function. This flexible design enables a robust, data-driven framework for selecting aggregation strategies that yield sharper prediction sets. We also provide theoretical insights that help justify the validity and performance of the SACP approach. Extensive experiments on diverse datasets show that SACP consistently improves efficiency and often outperforms state-of-the-art model aggregation baselines.
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Research Report > New Finding (0.67)
- Research Report > Promising Solution (0.54)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge
Dadkhahi, Hamid, Trabelsi, Firas, Riley, Parker, Juraska, Juraj, Mirzazadeh, Mehdi
Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking-rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation. Thinking large language models (LLMs) are increasingly being employed as automated judges for evaluating the output of other generative systems, a paradigm known as "Thinking-LLM-as-a-Judge" (Saha et al., 2025). This approach offers a scalable and cost-effective alternative to human evaluation, which is often slow and expensive. To mitigate the inherent stochasticity and noise of single-pass judgments, a common strategy is to leverage inference-time compute (ITC) Snell et al. (2024) by generating multiple independent reasoning and rating samples for each item being evaluated. However, the reliability of the final judgment hinges critically on how these multiple outputs are aggregated. Current aggregation methods, such as majority voting (Self-Consistency (Wang et al., 2023b)) or heuristics based on model confidence scores or LLM generated aggregators, are often brittle and statistically suboptimal.
- Europe > United Kingdom > England > Greater London > London (0.04)
- Asia > Singapore (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
Eliciting Categorical Data for Optimal Aggregation
Chien-Ju Ho, Rafael Frongillo, Yiling Chen
Models for collecting and aggregating categorical data on crowdsourcing platforms typically fall into two broad categories: those assuming agents honest and consistent but with heterogeneous error rates, and those assuming agents strategic and seek to maximize their expected reward. The former often leads to tractable aggregation of elicited data, while the latter usually focuses on optimal elicitation and does not consider aggregation. In this paper, we develop a Bayesian model, wherein agents have differing quality of information, but also respond to incentives. Our model generalizes both categories and enables the joint exploration of optimal elicitation and aggregation. This model enables our exploration, both analytically and experimentally, of optimal aggregation of categorical data and optimal multiple-choice interface design.
- North America > United States > Colorado (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.70)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.66)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.66)
Power Ensemble Aggregation for Improved Extreme Event AI Prediction
Collard, Julien, Gentine, Pierre, Zheng, Tian
This paper addresses the critical challenge of improving predictions of climate extreme events, specifically heat waves, using machine learning methods. Our work is framed as a classification problem in which we try to predict whether surface air temperature will exceed its q-th local quantile within a specified timeframe. Our key finding is that aggregating ensemble predictions using a power mean significantly enhances the classifier's performance. By making a machine-learning based weather forecasting model generative and applying this non-linear aggregation method, we achieve better accuracy in predicting extreme heat events than with the typical mean prediction from the same model. Our power aggregation method shows promise and adaptability, as its optimal performance varies with the quantile threshold chosen, demonstrating increased effectiveness for higher extremes prediction.
- North America > United States (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > France (0.04)
Reasoning Planning for Language Models
Nguyen, Bao, Nguyen, Hieu Trung, She, Ruifeng, Fu, Xiaojin, Nguyen, Viet Anh
Selecting an appropriate reasoning method for a given query remains a key challenge in language model generation. Existing approaches typically generate multiple candidate responses and use an aggregation strategy to select the output answer, often assuming that more candidate answers yield higher accuracy. We revisit this assumption through a rigorous theoretical analysis, deriving accuracy bounds for standard aggregation methods under fixed generation distributions and candidate sizes. Building on these insights, we introduce EPIC, an Ensemble Planning with Contrastive learning framework to learn a shared representation space that captures both model reasoning abilities and query-method compatibility. EPIC incorporates our probability bounds as a regularizer in a utility-driven optimization that balances accuracy and computational cost. Experiments on diverse mathematical reasoning tasks show that EPIC consistently selects optimal reasoning methods, improving accuracy while reducing computational overhead. Our code can be found at https://github.com/nguyenngocbaocmt02/EPIC.
- Europe > Austria > Vienna (0.14)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.68)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.66)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
On the Fragility of Contribution Score Computation in Federated Learning
Pejo, Balazs, Frank, Marcell, Varga, Krisztian, Veliczky, Peter, Biczok, Gergely
This paper investigates the fragility of contribution evaluation in federated learning, a critical mechanism for ensuring fairness and incentivizing participation. We argue that contribution scores are susceptible to significant distortions from two fundamental perspectives: architectural sensitivity and intentional manipulation. First, we explore how different model aggregation methods impact these scores. While most research assumes a basic averaging approach, we demonstrate that advanced techniques, including those designed to handle unreliable or diverse clients, can unintentionally yet significantly alter the final scores. Second, we explore vulnerabilities posed by poisoning attacks, where malicious participants strategically manipulate their model updates to inflate their own contribution scores or reduce the importance of other participants. Through extensive experiments across diverse datasets and model architectures, implemented within the Flower framework, we rigorously show that both the choice of aggregation method and the presence of attackers are potent vectors for distorting contribution scores, highlighting a critical need for more robust evaluation schemes.
- Europe > Hungary > Budapest > Budapest (0.05)
- North America > United States > Virginia (0.04)
- Europe > United Kingdom > England > Surrey > Guildford (0.04)
- Overview (1.00)
- Research Report > New Finding (0.68)
Predicting symbolic ODEs from multiple trajectories
Şahin, Yakup Emre, Kilbertus, Niki, Becker, Sören
We introduce MIO, a transformer-based model for inferring symbolic ordinary differential equations (ODEs) from multiple observed trajectories of a dynamical system. By combining multiple instance learning with transformer-based symbolic regression, the model effectively leverages repeated observations of the same system to learn more generalizable representations of the underlying dynamics. We investigate different instance aggregation strategies and show that even simple mean aggregation can substantially boost performance. MIO is evaluated on systems ranging from one to four dimensions and under varying noise levels, consistently outperforming existing baselines.